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Abstract 

People believe that depth plays an important role in success 
of deep neural networks (DNN). However, this belief lacks 
solid theoretical justifications as far as we know. We inves¬ 
tigate role of depth from perspective of margin bound. In 
margin bound, expected error is upper bounded by empiri¬ 
cal margin error plus Rademacher Average (RA) based ca¬ 
pacity term. First, we derive an upper bound for RA of DNN, 
and show that it increases with increasing depth. This indi¬ 
cates negative impact of depth on test performance. Second, 
we show that deeper networks tend to have larger represen¬ 
tation power (measured by Betti numbers based complexity) 
than shallower networks in multi-class setting, and thus can 
lead to smaller empirical margin error. This implies positive 
impact of depth. The combination of these two results shows 
that for DNN with restricted number of hidden units, increas¬ 
ing depth is not always good since there is a tradeoff between 
positive and negative impacts. These results inspire us to seek 
alternative ways to achieve positive impact of depth, e.g., im¬ 
posing margin-based penalty terms to cross entropy loss so 
as to reduce empirical margin error without increasing depth. 
Our experiments show that in this way, we achieve signifi¬ 
cantly better test performance. 


1 Introduction 

Deep neural networks (DNN) have achieved great practi¬ 
cal success in many machine learning tasks, such as speech 
recognition, image classification, and natural language 
processing (jHinton and Salakhutdinov 2006[ Krizhevsky, 
Sutskever, and Hinton 2012[|Hinton et al. 2012a| |Ciresan, 
Meier, and Schmidhuber 2012 Weston et al. 2012). Many 
people believe that the depth plays an important role in the 
success of DNN (|Srivastava, Greff, and Schmidhuber 2015[ 
Simony an and Zisserman 20 14||Lee et al. 2014| Romero et 
|al. 2014||He et al. 2015[|Szegedy et al. 2014[ ). However, as 
far as we know, such belief is still lacking solid theoretical 
justification. 

On one hand, some researchers have tried to understand 
the role of depth in DNN by investigating its generalization 
bound. For example, in (Bartlett, Maiorov, and Meir 1998; 
|Karpinski and Macintyre 1995[|Goldberg and Jerrum 1995), 
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generalization bounds for multi-layer neural networks were 
derived based on Vapnik-Chervonenkis (VC) dimension. 
In ( [Bartlett 1998[|Koltchinskii and Panchenko 2002 ) , a mar¬ 
gin bound was given to fully connected neural networks in 
the setting of binary classification. In (Neyshabur, Tomioka, 
and Srebro 2015[ >, the capacity of different norm-constrained 
feed-forward networks was investigated. While these works 
shed some lights on the theoretical properties of DNN, they 
have limitations in helping us understand the role of depth, 
due to the following reasons. First, the number of parameters 
in many practical DNN models could be very large, some¬ 
times even larger than the size of training data. This makes 
the VC dimension based generalization bound too loose to 
use. Second, practical DNN are usually used to perform 
multi-class classifications and often contains many convo¬ 
lutional layers, such as the model used in the tasks of Im- 
ageNet ( [Deng et al. 2009] ). However, most existing bounds 
are only regarding binary classification and fully connected 
networks. Therefore, the bounds cannot be used to explain 
the advantage of using deep neural networks. 


On the other hand, in recent years, researchers have tried 
to explain the role of depth from other angles, e.g., deeper 
neural networks are able to represent more complex func¬ 
tions. In (Hastad 1986; Delalleau and Bengio 2011), authors 
showed that there exist families of functions that can be 
represented much more efficiently with a deep logic circuit 
or sum-product network than with a shallow one, i.e., with 
substantially fewer hidden units. In (Bianchini and Scarselli 


[20141 [Montufar et al. 2014j i, it was demonstrated that deeper 
nets could represent more complex functions than shallower 
nets in terms of maximal number of linear regions and Betti 
numbers. However, these works are apart from the general¬ 
ization of the learning process, and thus they cannot be used 
to explain the test performance improvement for DNN. 


To reveal the role of depth in DNN, in this paper, we 
propose to investigate the margin bound of DNN. Accord¬ 
ing to the margin bound, the expected 0-1 error of a DNN 
model is upper bounded by the empirical margin error plus 
a Rademacher Average (RA) based capacity term. Then we 
first derive an upper bound for the RA-based capacity term, 
for both fully-connected and convolutional neural networks 
in the multi-class setting. We find that with the increasing 
depth, this upper bound of RA will increase, which indi¬ 
cates that depth has its negative impact on the test perfor- 












































mance of DNN. Second, for the empirical margin error, we 
study the representation power of deeper networks, because 
if a deeper net can produce more complex classifiers, it will 
be able to fit the training data better w.r.t. any margin co¬ 
efficient. Specifically, we measure the representation power 
of a DNN model using the Betti numbers based complex¬ 
ity ( |Bianchini and Scarselli 2014 1 , and show that, in the 
multi-class setting, the Betti numbers based complexity of 
deeper nets are indeed much larger than that of shallower 
nets. This, on the other hand, implies the positive impact 
of depth on the test performance of DNN. By combining 
these two results, we can come to the conclusion that for 
DNN with restricted number of hidden units, arbitrarily in¬ 
creasing the depth is not always good since there is a clear 
tradeoff between its positive and negative impacts. In other 
words, with the increasing depth, the test error of DNN may 
first decrease, and then increase. This pattern of test error 
has been validated by our empirical observations on differ¬ 
ent datasets. 

The above theoretical findings also inspire us to look for 
alternative ways to achieve the positive impact of depth, and 
avoid its negative impact. For example, it seems feasible to 
add a margin-based penalty term to the cross entropy loss 
of DNN so as to directly reduce the empirical margin error 
on the training data, without increasing the RA of the DNN 
model. For ease of reference, we call the algorithm mini¬ 
mizing the penalized cross entropy loss large margin DNN 
(LMDNNn We have conducted extensive experiments on 
benchmark datasets to test the performance of LMDNN. The 
results show that LMDNN can achieve significantly better 
test performance than standard DNN. In addition, the mod¬ 
els trained by LMDNN have smaller empirical margin error 
at almost all the margin coefficients, and thus their perfor¬ 
mance gains can be well explained by our derived theory. 

The remaining part of this paper is organized as follows. 
In Section [2j we give some preliminaries for DNN. In Sec¬ 
tion [3j we investigate the roles of depth in RA and empiri¬ 
cal margin error respectively. In Section [4] we propose the 
large margin DNN algorithms and conduct experiments to 
test their performances. In Section[5j we conclude the paper 
and discuss some future works. 


ing definition of the margin p(f- 1 x,y) of the model / at a 
labeled sample (x, y): 

P(f',x,y) = f(x,y) - mzxf(x,k). (1) 

k^y 

The classification accuracy of the prediction model / is 
measured by its expected 0-1 error, i.e., 

errp(f) = Pr I[ ar g m ax fce3 , f(x,k)^ y ] (2) 

(x,y)~P 

= , P. r (3) 

(x,y)~P 

where ![.] is the indicator function. 

We call the 0-1 error on the training set training error and 
that on the test set test error. Since the expected 0-1 error 
cannot be obtained due to the unknown distribution P, one 
usually uses the test error as its proxy when examining the 
classification accuracy. 

Now, we consider using neural networks to fulfill the 
multi-class classification task. Suppose there are L layers 
in a neural network, including L — 1 hidden layers and an 
output layer. There are ni units in layer l (l = 1,, L). The 
number of units in the output layer is fixed by the classifi¬ 
cation problem, i.e., n L = K. There are weights associated 
with the edges between units in adjacent layers of the neural 
network. To avoid over fitting, people usually constraint the 
size of the weights, e.g., impose a constraint A on the sum 
of the weights for each unit. We give a unified formulation 
for both fully connected and convolutional neural networks. 
Mathematically, we denote the function space of multi-layer 
neural networks with depth L, and weight constraint A as 
i.e., 


n L- 1 

Pa = |(x,fc) ->■ ^2 Wifi( 

4=1 
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2 Preliminaries 

Given a multi-class classification problem, we denote X = 
R d as the input space, y = {1, • ■ ■ , K} as the output space, 
and P as the joint distribution over X x y. Here d denotes 
the dimension of the input space, and K denotes the num¬ 
ber of categories in the output space. We have a training set 
S = {(xi, r/i), • • • , (x m , ym)}, which is i.i.d. sampled from 
X x y according to distribution P. The goal is to learn a 
prediction model f £ P : X x }’ -> R from the train¬ 
ing set, which produces an output vector (f(x,k);k £ y) 
for each instance x £ X indicating its likelihood of belong¬ 
ing to category k. Then the final classification is determined 
by argmaxfegy f(x, k). This naturally leads to the follow¬ 

*One related work is |Li et al. 2015), which combines the gen¬ 
erative deep learning methods (e.g., RBM) with a margin-max pos¬ 
terior. In conUast, our approach aims to enlarge the margin of dis¬ 
criminative deep learning methods like DNN. 


P'a = 

HE 


4 = 1 

and, 

T o 

P'A 


J r A = [x^f x\f,i €{!,••• ,d}j; 


«}; 

( 6 ) 

(7) 


where Wi denotes the weight in the neural network, Xu is the 
*-th dimension of input x. The functions ip and <t> are defined 
as follows: 

(1) If the /-th layer is a convolutional layer, the out¬ 
puts of the (1 — l)-th layer are mapped to the Z-th layer 
by means of filter, activation, and then pooling. That is, 
in Eqn (|6j, lots of weights equal 0, and n; is determined 
by ni -1 as well as the number and domain size of the fil¬ 
ters. In Eqn <|5}, pi equals the size of the pooling region 
in the Z-th layer, and function ip : R Pi — > R is called 
the pooling function. Widely-used pooling functions include 








the max-pooling max(fi, ■ • • ,t Pl ) and the average-pooling 
(ti + ■ ■ • + t pi )/pi. Function fi is increasing and usually 
called the activation function. Widely-used activation func¬ 
tions include the standard sigmoid function fi{t) = , , 

the tanh function fi(t) = e e tf e e -t , and the rectifier function 
fi(t) = max(0, t). Please note that all these activation func¬ 
tions are 1-Lipschitz. 

(2) If the /-th layer is a fully connected layer, the outputs 
of the ( l — l)-th layer are mapped to the /-th layer by linear 
combination and subsequently activation. That is, in Eqn ([^]i 

pi = 1 and <p(x) = x. 

Because distribution P is unknown and the 0-1 error is 
non-continuous, a common way of learning the weights in 
the neural network is to minimize the empirical (surrogate) 
loss function. A widely used loss function is the cross en¬ 
tropy loss, which is defined as follows, 

K 

C(f',x,y) = — z k lna(g, fc), ( 8 ) 

k =1 

where Zk = 1 if k = y, and Zk = 0 otherwise. Here 
o(x, k) = is the softmax operation that nor- 

Z^j=i ex P \J\ x i3)) 

malizes the outputs of the neural network to a distribution. 

Back-propagation algorithm is usually employed to mini¬ 
mize the loss functions, in which the weights are updated by 
means of stochastic gradient descent (SGD). 

3 The Role of Depth in Deep Neural 
Networks 

In this section, we analyze the role of depth in DNN, 
from the perspective of the margin bound. For this purpose, 
we first give the definitions of empirical margin error and 
Rademacher Average (RA), and then introduce the margin 
bound for multi-class classification. 

Definition 1. Suppose f £ J : 1/ x }’R is a multi-class 
prediction model. For V 7 > 0, the empirical margin error of 
/ at margin coefficient 7 is defined as follows: 


According to the margin bound given in Theorem [T] the 
expected 0-1 error of a DNN model can be upper bounded 
by the sum of two terms, RA and the empirical margin error. 
In the next two subsections, we will make discussions on the 
role of depth in these two terms, respectively. 


3.1 Rademacher Average 

In this subsection, we study the role of depth in the RA- 
based capacity term. 

In the following theorem, we derive an uniform upper 
bound of RA for both the fully-connected and convolutional 
neural networks 0 

Theorem 2. Suppose input space X = [— M, M] d . In the 
deep neural networks, if activation function fi is LLips- 
chitz and non-negative, pooling function ip is max-pooling 
or average-pooling, and the size of pooling region in each 
layer is bounded, i.e., pi < p, then we have, 


Rm{Fk) < cM 





where c is a constant. 


( 12 ) 


Proof. According to the definition of p\ and RA, we have, 

Rm{J r A.)= Ex,J sup — (Ti Wjfj(Xi)\\ 

L I, I, ^ I m ■“ IJ 

II w ll 1 <A,fj A i=i j=i 

2 n L —1 m 

= E x , ct [ sup — V] 1. 

ll w lll<^>/j £-7^4 j =1 i =1 

Supposing w = {wi,-■ ■ ,w nL _ 1 } and h = 

(*<),•;• ,TZi^fn L -Ax j}, the inner prod¬ 
uct (w, h) is maximized when w is at one of the extreme 
points of the 1 1 ball, which implies: 

1 o m 

RmilFk) < AE x , ct [ sup — Y' <Jif(Xi) 1 

/e^T 1 m *=1 

= ARmiF’r 1 )- (13) 


^ lit 

err s(f) = ~ X^WAai.Vi)^]- ( 9 ) 

i=1 

Definition 2. Suppose J- : X -A R is a model space with 
a single dimensional output. The Rademacher average (RA) 
of? is defined as follows: 

1 9 m 

Rm(P) = E x , ct [sup — y' CTi/(Xi)l, (10) 

where x = { 27 , ■ ■ ■ , x m } ~ P™, and {a 1 , • • • , a m } are 
i.i.d. sampled with P(<Ji = 1) = 1 / 2 , P(<Ji = —1) = 1 / 2 . 

Theorem 1. Koltchinskii and Panchenko 2002) Suppose 
f £ P : X x y —> M fPa multi-class prediction model. For 
\/5 > 0, with probability at least 1 — <5, we have, V/ £ P, 

errp(f) < inf jerr/(/) + — — Rm(P) 

7>0 L 'y 

^}- <■» 

where P = {x —> /(•, k);k £ y, f £ P} 


For function class P I ff~ X , if the (L — l)-th layer is a fully 
connected layer, it is clear that < R m {fi o 

P A 1 ) holds. If the (L — l)-th layer is a convolutional layer 
with max-pooling or average-pooling, we have, 

R^Pa' 1 ) 

2 m PL—1 

<E X J sup — 1 

f^-T PL _P^A~ i=i 3=1 

= pL-iRm(<j> o Pa -1 )- (I 4 ) 

The inequality ( fj~4| > holds due to the fact that most widely 
used activation functions fi (e.g., standard sigmoid and rec¬ 
tifier) have non-negative outputs. 

Therefore, for both fully connected layers and convolu¬ 
tional layers, R m (P a 1 ) 5= PL- 1 Rmi.fi 0 Pa~ 1 ) uniformly 

2 To the best of our knowledge, an upper bound of RA for fully 
connected neural networks has been derived before ({Bartlett and] 
Mendelson 2003; Neyshabur, Tomioka, and Srebr o 201 5[ >, but there 
is no result available for the convolutional neural networks. 
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holds. Further considering the Lipschitz property of cf>, we 
have. 


Rm^- 1 ) < 2p L - 1 L <j> R m {F\- 1 ). (15) 


Iteratively using maximization principle of inner product in 
property of RA in ( [T4| and Lipschitz property in ([T5] l, 
considering pi < p, we can obtain the following inequality. 


Rm(Rk) < {2pL^A) L ~ * 1 R rn {F 1 A ). (16) 


According to (Bartlett and Mendelson 2003), R rn (fF \) can 
be bounded by: 


„ , ^-i „ . , , /In d 

R m (F a) < cAM\ -, 

V m 


(17) 


where c is a constant. 


Combining (16 1 and (17 1 , we can obtain the upper bound 

□ 


on the RA ofDNN. 


From the above theorem, we can see that with the increas¬ 
ing depth, the upper bound of RA will increase, and thus the 
margin bound will become looser. This indicates that depth 
has its negative impact on the test performance of DNN. 

3.2 Empirical Margin Error 

In this subsection, we study the role of depth in empirical 
margin error. 

To this end, we first discuss representation power of DNN 
models. In particular, we use the Betti numbers based com¬ 
plexity (Bianchini and Scarselli 2014) to measure the repre¬ 
sentation power. We generalize the definition of Betti num¬ 
bers based complexity into multi-class setting as follows. 

Definition 3. The Betti numbers based complexity of func¬ 
tions implemented by multi-class neural networks F A is de¬ 


fined as N(F A ) = B{Si), where B(S {) is the sum 


of Betti number^ that measures the complexity of the set 
Si. Here S z = x € | f(x,i) - f(x,j) > 

As can be seen from the above definition, the Betti num¬ 
bers based complexity considers classification output and 
merge those regions corresponding to the same classification 
output (thus is more accurate than the linear region number 
complexity ( Montufar et al. 2014| ) in measuring the repre¬ 
sentation power). As far as we know, only for binary clas¬ 
sification and fully connected networks, the bounds of the 
Betti numbers based complexity was derived (Bianchini and 
Scarselli 2014), and there is no result for the setting of multi 


class classification and convolutional networks. In the fol¬ 
lowing, we give our own theorem to fill in this gap. 

Theorem 3. For neural networks F A that has h hidden 
units. If activation function <fr is a Pfaffian function with com¬ 
plexity (a, /3, rf), pooling function p is average-pooling and 


3 For any subset S C R d , there exist d Betti numbers, denoted 
as bj(S ), 0 < j < d — 1. Therefore, the sum of Betti numbers is 
denoted as B(S) = ff'jZo bj(S). Intuitively, the first Betti num¬ 
ber bo(S) is the number of connected components of the set S, 
while the j-th Betti number b :l (S) counts the number of (j + 1)- 


dimension holes in S (Bianchini and Scarselli 2014). 


d < hi], then 

N(F A ) < (K - l) d+1 2 hrl(hrl ~ 1)/2 

x O ((d ((a + p - 1 + a/3) (L — 1) + f3 (a + l))) d+ '"') 

(18) 

Proof. We first show that the functions f(x , •) £ are 
Pfaffian functions with complexity ((a + /3 — 1 + a/3)(L — 
1 ) + a/3, /3, hp ), where F A can contain both fully-connected 
layers and convolutional layers. Assume the Pfaffian chain 
which defines activation function <f>(t) is (<pi (t),..., 0 „(i)), 
and then s l is constructed by applying all <p t . 1 <_i < ?/ 
on all the neurons up to layer l — 1, i.e., f l £ 3F l A . I £ 
{1,..., L — l}.As the first step, we need to get the degree 
of ,f l in the chain s l . Since f 1 = ) + 

■ ’ 1 + 1 )) an d ^ is a Pfaffian function, f l is a polyno¬ 

mial of degree /3 in the chain s l . Then, it remains to show 
that the derivative of each function in s l , i.e., ^ = 

d< ^df ^ §£_’ can b e defined as a polynomial in the functions 
of the chain and the input. For average pooling, by iteratively 
using chain rule, we can obtain that the highest degree terms 

of are in the form of n!=i ^ • Following the lemma 


2 in ( jBianchini and Scarselli 2014 ), we obtain the complex¬ 
ity of f(x,-) £ T’f. 

Furthermore, the sum of two Pfaffian functions 
fi and fi defined by the same Pffaffian chain of 
length t] with complexity (ai,/3i,r]) and (a2,/32,??) 
respectively is a Pfaffian function with complexity 
( max ftti, af), max(/3i, h) (Gabrielov and Vorobjov 
2004). Therefore, f(x, i) — j is a Pfaffian func¬ 

tion with complexity ((a + /3 — 1 + a/3) {L — 1) + a/3, /3, hp). 

According to ( Zell 1999) , since Si is defined by K — 
1 sign conditions (inequalities or equalities) on Pfaffian 
functions, and all the functions defining Si have com¬ 
plexity at most ((a + fi + a/3)(L — 1) + afi,/3,hp), 
B{Si) can be upper bounded by (K - 1 )d 2 h r i(.h v -i )/2 x 

0((d ((a + /3 — 1 + apt) (L — 1) + /3 (a + l))) d+hri ). 

Summing over all* £ {1,..., K — 1}, we get the results 
stated in Theorem [3 □ 


Theorem [3] upper bounds the Betti numbers based com¬ 
plexity for general activation functions. For specific active 
functions, we can get the following results: when <f> = 
arctan(-) and d < 2 h, since arctan is of complexity 
(3,1,2), we have N(/F A ) < (K - l) d+1 2 h< ' 2h ~ 1 ' 1 0((d(L — 

1 ) + d) d+2h )\ when <f> = tanh(-) and n < h, since 
tanh is of complexity (2,1,1), we have N(fF A ) < (K — 

l)d+i 2 Hh-i)/2 0 ^d{L - 1) + d) d+h ). 

Basically, Theorem [3] indicates that in the multi-class set¬ 
ting, the Betti numbers based complexity grows with the in¬ 
creasing depth L. As a result, deeper nets will have larger 
representation power than shallower nets, which makes 
deeper nets fit better to the training data and achieve smaller 
empirical margin error. This indicates that depth has its pos¬ 
itive impact on the test performance of DNN. 
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(a)MNIST (b) CIFAR-10 

Figure 1: The influence of depth on empirical margin error. 



(a) MNIST 



(b) CIFAR-10 


Figure 2: The influence of depth on test error. 


Actually, above discussions about impact of depth on rep¬ 
resentation power are consistent with our empirical findings. 
We conducted experiments on two datasets, MNIST ( |LeCun| 
et al. 1998l and CIFAR-10 ( |Krizhevsky 2009 1 . To inves¬ 
tigate the influence of network depth L, we trained fully- 
connected DNN with different depths and restricted number 
of hidden units. The experimental results are shown in Fig¬ 
ure |T] and indicate that no matter on which dataset, deeper 
networks have smaller empirical margin errors than shal¬ 
lower networks for most of the margin coefficients. 


3.3 Discussions 

Based on discussions in previous two subsections, we can 
see that when the depth L of DNN increases, (1) the RA term 
in margin bound will increase (according to Theorem^; (2) 
the empirical margin error in margin bound will decrease 
since deeper nets have larger representation power (accord¬ 
ing to Theorem [3]). As a consequence, we can come to the 
conclusion that, for DNN with restricted number of hidden 
units, arbitrarily increasing depth is not always good since 
there is a clear tradeoff between its positive and negative im¬ 
pacts on test error. In other words, with the increasing depth, 
the test error of DNN may first decrease, and then increase. 

Actually this theoretical pattern is consistent with our em¬ 
pirical observations on different datasets. We used the same 
experimental setting as that in the subsection 3.2 and re¬ 
peated the training of DNN (with different random initializa¬ 
tions) for 5 times. Figure[2]reports the average and minimum 
test error of 5 learned models. We can observe that as the 
depth increases, the test error first decreases (probably be¬ 
cause increased representation power overwhelms increased 
RA capacity); and then increase (probably because RA ca¬ 
pacity increases so quickly that representation power cannot 
compensate for negative impact of increased capacity). 


4 Large Margin Deep Neural Networks 

From the discussions in Section[3] we can see that one may 
have to pay the cost of larger RA capacity when trying to 
obtain better representation power by increasing the depth 
of DNN (not to mention that the effective training of very 
deep neural networks is highly non-trivial (Glorot and Ben- 
gio 2010j|Srivastava, Greff, and Schmidhuber 2015| >). Then 
a nature question is whether we can avoid this tradeoff, and 
achieve good test performance in an alternative way. 


To this end, let us revisit the positive impact of depth: 
it actually lies in that deeper neural networks tend to have 
larger representation power and thus smaller empirical mar¬ 
gin error. Then the question is: can we directly minimize em¬ 
pirical margin error? Our answer to this question is yes, and 
our proposal is to add a margin-based penalty term to current 
loss function. In this way, we should be able to effectively 
tighten margin bound without manipulating the depth. 

One may argue that widely used loss functions (e.g., cross 
entropy loss and hinge loss) in DNN are convex surrogates 
of margin error by themselves, and it might be unnecessary 
to introduce an additional margin-based penalty term. How¬ 
ever, we would like to point out that unlike hinge loss for 
S VM or exponential loss for Adaboost, which have theoreti¬ 
cal guarantee for convergence to margin maximizing separa¬ 


tors as the regularization vanishes (Rosset, Zhu, and Hastie 
2003|, there is no optimization consistency guarantee for 


these losses used in DNN since neural networks are highly 
non-convex. Therefore, it makes sense to explicitly add a 
margin-based penalty term to loss function, in order to fur¬ 
ther reduce empirical margin error during training process. 


4.1 Algorithm Description 

We propose adding two kinds of margin-based penalty terms 
to the original cross entropy los^j The first penalty term is 
the gap between the upper bound of margin (i.e., 1 Jjand the 
margin of the sample (i.e., p(f- x, y)). The second one is the 
average gap between upper bound of margin and the differ¬ 
ence between the predicted output for the true category and 
those for all the wrong categories. It can be easily verified 
that the second penalty term is an upper bound of the first 
penalty term. Mathematically, the penalized loss functions 
can be described as follows (for ease of reference, we call 
them Ci and C> respectively): for model /, sample x, y, 

Ci(f;x,y) =C(f;x, y) + A^l - p(f',x,y )) , 

C 2 (f;x,y) =C(f;x, y) 

+ ( 1 ~ (/(*’ y) - /(®> fe ))) ■ 

k^y 

4 Although we take the most widely-used cross entropy loss as 
example, these margin-based penalty terms can also be added to 
other loss functions. 

^Please note that, after softmax operation, the outputs are nor¬ 
malized to [0,1] 































































MNIST 

CIFAR-10 

DNN-C (%) 
LMDNN-Ci (%) 
LMDNN-C 2 (%) 

0.899 ±0.038 

0.734 ±0.046 

0.736 ±0.041 

18.339 ±0.336 

17.598 ±0.274 

17.728 ±0.283 


Table 1: Test error (%) of DNN-C and LMDNNs. 



Figure 3: Empirical margin error of LMDNNs. 


We call the algorithms that minimize the above new 
loss functions large margin DNN algorithms (LMDNN). 
For ease of reference, we denote LMDNN minimizing C\ 
and C 2 as LMDNN-C'i and LMDNN-C 2 respectively, and 
the standard DNN algorithms minimizing C as DNN-C'. 
To train LMDNN, we also employ the back propagation 
method. 

4.2 Experimental Results 

Now we compare the performances of LMDNNs with DNN- 
C. We used well-tuned network structures in the Caffe (Jia 
et al. 2014) 1 tutorial (i.e., LeNej^Jfor MNIST and AlexNejJ 
for CIFAR-10) and all the tuned hyper parameters on the 
validation set. 

Each model was trained for 10 times with different initial¬ 
izations. Table [I] shows mean and standard deviation of test 
error over the 10 learned models for DNN-C and LMDNNs 
after tuning margin penalty coefficient A. We can observe 
that, on both MNIST and CIFAR-10, LMDNNs achieve 
significant performance gains over DNN-C. In particular, 
LMDNN-Ci reduce test error from 0.899% to 0.734% 
on MNIST and from 18.399% to 17.598% on CIFAR-10; 
LMDNN-C 2 reduce test error from 0.899% to 0.736% on 
MNIST and from 18.399% to 17.728% on CIFAR-10. 

To further understand the effect of adding margin-based 
penalty terms, we plot empirical margin errors of DNN-C 
and LMDNNs in Figure [3] We can see that by introduc¬ 
ing margin-based penalty terms, LMDNNs indeed achieve 
smaller empirical margin errors than DNN-C. Furthermore, 
the models with smaller empirical margin errors really 
have better test performances. For example, LMDNN-Ci 
achieved both smaller empirical margin error and better test 


£ http://caffe.berkeleyvision.org/gathered/ 
examples/mnist.html 

'http://caffe.berkeleyvision.org/gathered/ 
examples/cifarlO.html 



(a) MNIST (b) CIFAR-10 


Figure 4: Test error of LMDNNs with different A. 


performance than LMDNN-C 2 . This is consistent with The¬ 
orem]^] and in return indicates reasonability of our theorem. 

We also report mean test error of LMDNNs with differ¬ 
ent margin penalty coefficient A (see Figure[4|. In the figure, 
we use dashed line to represent mean test error of DNN-C 
(corresponding to A = 0). From the figure, we can see that 
on both MNIST and CIFAR-10, (1) there is a range of A 
where LMDNNs outperform DNN-C; (2) although the best 
test performance of LMDNN-C 2 is not as good as that of 
LMDNN-Ci, the former has a broader range of A that can 
outperform DNN-C in terms of the test error. This indicates 
the value of using LMDNN-C 2 : it eases the tuning of hyper 
parameter A; (3) with increasing A, test error of LMDNNs 
will first decrease, and then increase. When A is in a reason¬ 
able range, LMDNNs can leverage both good the optimiza¬ 
tion property of cross entropy loss in training process and the 
effectiveness of margin-based penalty term, and thus achieve 
good test performance. When A becomes too large, margin- 
based penalty term dominates cross entropy loss. Consider¬ 
ing that margin-based penalty term may not have good opti¬ 
mization property as cross entropy loss in the training pro¬ 
cess, the drop of test error is understandable. 

5 Conclusion and Future Work 

In this work, we have investigated the role of depth in DNN 
from the perspective of margin bound. We find that while the 
RA term in margin bound is increasing w.r.t. depth, the em¬ 
pirical margin error is decreasing instead. Therefore, arbi¬ 
trarily increasing the depth might not be always good, since 
there is a tradeoff between the positive and negative impacts 
of depth on test performance of DNN. Inspired by our the¬ 
ory, we propose two large margin DNN algorithms, which 
achieve significant performance gains over standard DNN 
algorithm. In the future, we plan to study how other factors 
influence the test performance of DNN, such as unit allo¬ 
cations across layers and regularization tricks. We will also 
work on the design of effective algorithms that can further 
boost the performance of DNN. 
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Specifically, for MNIST and CIFAR-10, the DNN models 
with depth 2, 3, 4, 5 and 6 respectively have 3000, 1500, 
1000, 750 and 600 units in each hidden layer when the total 
number of hidden units is 3000. 

B Experimental Settings in Section 4.2 

For data pre-processing, we scale the pixel values in MNIST 
to [0,1], and subtract the per-pixel mean computed over the 
training set from each image in CIFAR-10. On both datasets, 
we do not use data augmentation for simplicity. 

For network structure, we used the well-tuned neural net¬ 
work structures as given in the Caffe tutorial (i.e., LeNet for 
MNIST and AlexNet) for CIFAR-10). 

For the training process, the weights are initialized ran¬ 
domly and updated by mini-batch SGD. We use the model 
in the last iteration as our final model. For DNN-C, all the 
hyper parameters are set by following Caffe tutorial. For 
LMDNNs, all the hyper parameters are tuned to optimal 
on the validation set. Finally, we find that by using the fol¬ 
lowing hyper parameters, both DNN-C and LMDNNs can 
achieve best performance as we reported. For MNIST, we set 
the batch size as 64, the momentum as 0.9, and the weight 
decay coefficient as 0.0005. Each neural network is trained 
for 10k iterations and the learning rate in each iteration T 
decreases by multiplying the initial learning rate with a fac¬ 
tor of (1 + O.OOOIT) -0 - 75 . For CIFAR-10, we set the batch 
size as 100, the momentum as 0.9, and the weight decay 
coefficient as 0.004. Each neural network is trained for 70k 
iterations. The learning rate is set to be 10~ 3 for the first 60k 
iterations, 10 -4 for the next 5k iterations, and 10 -5 for the 
other 5k iterations. 


Appendices 

A Experiment Settings in Section 3.2 

The MNIST dataset (for handwritten digit classification) 
consists of 28 x 28 black and white images, each containing 
a digit 0 to 9. There are 60k training examples and 10k test 
examples in this dataset. The CIFAR-10 dataset (for object 
recognition) consists of 32 x 32 RGB images, each contain¬ 
ing an object, e.g., cat, dog, or ship. There are 50k train¬ 
ing examples and 10k test examples in this dataset. For each 
dataset, we divide the 10k test examples into two subsets 
of equal size, one for validation and the other for testing. 
In each experiment, we use standard sigmoid activation in 
hidden layers and train neural networks by mini-batch SGD 
with momentum and weight decay. All the hyper-parameters 
are tuned on the validation set. 


To investigate the influence of the network depth L, we 
train fully-connected DNN models with different depths and 
restricted number of hidden units. For simplicity and also 


following many previous works (Simard, Steinkraus, and 
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), we assume that each hid- 


den layer has the same number of nodes in the experiment. 













